Measuring the suicidal mind: The ‘open source’ Suicidality Scale, for adolescents and adults

Clinicians are expected to provide accurate and useful mental health assessments, sometimes in emergency settings. The most urgent challenge may be in calculating suicide risk. Unfortunately, existing instruments often fail to meet requirements. To address this situation, we used a sustainable scale development approach to create a publicly available Suicidality Scale (SS). Following a critical review of current measures, community input, and panel discussions, an international item pool survey included 5,115 English-speaking participants aged 13–82 years. Revisions were tested with two follow-up cross-sectional surveys (Ns = 814 and 626). Pool items and SS versions were critically examined through item response theory, hierarchical cluster, factor and bifactor analyses, resulting in a unidimensional eight-item scale. Psychometric properties were high (loadings > .77; discrimination > 2.2; test-retest r = .87; internal consistency, ω = .96). Invariance checks were satisfied for age, gender, ethnicity, rural/urban residence, first language, self-reported psychiatric diagnosis and suicide attempt history. The SS showed stronger psychometric properties, and significant differences in bivariate associations with depressive symptoms, compared with included suicide measures. The ‘open source’ Suicidality Scale represents a significant step forward in accurate assessment for people aged 13+, and diverse populations. This study provides an example of sustainable scale development utilizing community input, emphasis on strong psychometric evidence from diverse samples, and a free-to-use license allowing instrument revisions. These methods can be used to develop a wide variety of psychosocial instruments that can benefit clinicians, researchers, and the public.


Introduction
Suicide resides at the very core of deaths by despair [1,2]. Due to this importance, there are long-standing recommendations for clinicians to conduct routine suicide risk assessments [SRA; 3,4]. However, low-validity SRAs can lead to poorly guided clinical decisions. As with other psychosocial constructs, quantifying the latent trait, suicidality, requires high instrument precision with a focus on the fundamental nature of the construct. Despite serious consequences, those selecting and using tests may not be giving sufficient attention to psychological science, particularly psychometrics [5][6][7][8][9][10][11]. A lack of focus on psychometric validity, and concerns over psychological science replication [12], has resulted in continued use of popular measures, regardless of demonstrated weaknesses.
In response to current assessment practices, a growing number of psychological scientists are advocating for greater emphasis on measurement validity over consistency [e.g., 13, 14]. That may be particularly relevant for SRAs, which have not notably improved since Beck and colleagues published the Scale for Suicide Ideation [SSI; 15] in 1979. To address the urgent need for accurate assessments, this study utilized a sustainable scale development approach for the Creative Commons licensed (free culture) Suicidality Scale (SS) for adolescents, adults, and diverse populations.
We hypothesize that a highly valid measure of the latent trait, current suicidality, may be the best candidate for predicting future suicidal distress and suicide. To measure a latent trait, we first need to define it and determine how it can be quantified. Many find the term suicidality useful as it encompasses the totality of the multifaceted suicidal mind. Decades of evidence and theory reveal a complicated dynamic of affective, cognitive, and behavioral attributes that are volatile but can also pose long-term risk [15][16][17]. We consider suicidality as the extant summation of one's feelings, thoughts and behaviors related to taking one's life. Facets which require strong empirical evidence if they are to form a highly accurate measure.

Measurement models
To understand the current underwhelming state of SRAs we can look to the overwhelming popularity of classical test theory (CTT). There are various measurement models to consider when validating a latent trait instrument. The parallel model stipulates all items are equal in measuring the same trait with the same level of accuracy, identical response sets, and identical error [18,19]. Similarly, CTT assumes a tau-equivalent model, identical to the parallel model but item errors may vary. The congeneric model, in contrast, assumes items measure the same latent trait but can vary in precision, response sets, and error. Also of importance, sum scores (summing item scores for a scale total) require tau-equivalence, all items and response steps are equally and uniformly quantifiable [20,21]. Popular psychometric analyses such as Cronbach's alpha, confirmatory factor analysis (CFA), receiver operating characteristics (ROC) and area under the curve (AUC), assume tau-equivalence [22,23]. ROC also requires true binary outcomes [24]. The congeneric model, however, fits best with decades of evidence of a lack of SRA tau-equivalence, demonstrated through heterogeneity in factor loadings, discrimination, information functions, etc. [25][26][27][28]. Psychometric analyses consistent with congeneric models include factor analysis (FA), bifactor analysis (BA), McDonald's omega, and IRT (item response theory).
Another fundamental decision is defining a measurement model as reflective or formative. Most psychological measures are reflective, highly correlated items that are indirect indicators of common factors. However, with formative measures, items may be loosely correlated but are required components of a composite factor [29]. A classic example of a formative factor is socioeconomic status, which can be derived through items on income, education, and other components. Many popular SRAs are formative, constructed through indexes-checklists of items that can be scored present/absent. The SAD PERSONS and Manchester Self-Harm Rule are indexes that are hypothesized to form cumulative values of suicide risk [30]. Many hospitals, clinicians and researchers use SRAs that include variables such as sex (males scored atrisk), relationship status (unpartnered scored at-risk), and major depressive disorder diagnosis (scored at-risk). Additionally, the DSM-V mood disorders group was reportedly developing an SRA index based on presence/absence of suicide attempts, plans, substance abuse, and living alone [31]. The implicit measurement hypothesis of SRA indexes is-population suicide risk and protective factors can be counted, and through simple addition and subtraction a highly accurate personal risk score can be calculated.
Despite decades of psychometric evidence and advances in statistical software, SRA validation studies have mostly used CTT methods, typically have not held instruments to high standards, and often included dichotomous items and outcomes. False dichotomies, for example, can violate core test assumptions, leading to false findings [32][33][34]. Nevertheless, WHO's Composite International Diagnostic Interview [WMH-CIDI; 35], was tested using sum scores and ROC and AUC analyses, with disputable binary outcomes [36]. Many CIDI dichotomous items were integrated into the Columbia-Suicide Severity Rating Scale [C-SSRS; 37]. More recently, a moderate-sized computer-adaptive test study (CAT; N = 308) used 11 dichotomous and ordinal items, based partly on the C-SSRS, with factor loadings as low as .59 described as 'strong' [28]. The authors used sum scores and cutoffs to create opaque risk groups, with substantial trait overlap. These studies are consistent with most SRA validation efforts, measurement models are not specified, analyses may not be justified, and instruments fail to demonstrate strong validity.
The SSI, a decades-old standard for SRAs, was developed over years through modifying various instruments, resulting in 19 items with differing three-point response sets [15]. Developers used relevant psychological constructs (e.g., suicidal ideation, the wish to die) rather than demographics, pointing out that demographics can indicate group risk differences but are not appropriate for individual risk assessments. Numerous studies have examined SSI validity, but these have not led to significant improvements. For example, an IRT analysis [25] seemed to err on the side of consistency by concluding only two items should be revised or deleted. However, findings showed additional items with low discrimination (low ability in differentiating trait levels). Despite limitations, the SSI remains popular and is part of many test banks, including the PhenX Toolkit for genetic studies [38]. This has allowed for consistency between assessments, but at a cost to validity.
In addition to choosing the right model, there are other measurement details to consider. Pek and Flora [39] identified measurement problems as metrics (response sets, response options), and the construct (the closeness/distance between the observed scores and the hypothesized latent trait). To address such concerns, measurement validation should focus on the processes that produce changes in the instrument's values [5]. For example, we could examine whether suicide attempt items can be adequately quantified through yes/no responses or whether a polytomous item including intent to die provides additional monotonic (increasing/decreasing) grades [40,41]. That is, focusing on the underlying facets of an item that produce measurable changes when individuals move from low to higher risk.

Study aims
The primary goal was to determine a unidimensional item set that would best capture the suicidal attributes that are most valid across diverse populations and ages. Suicidality was conceptualized as a latent trait composed of several interrelated facets. We chose a congeneric model [10,20,21], as relevant facets may be dimensional and are likely to vary on trait coverage and information captured. To achieve study aims, we employed theory-informed, but evidence driven scale development practices [e.g., 42-45]. Scale validity required evidence of strong model fit, high item discrimination and information levels, high predictive ability, and invariance by demographic groupings.
In addition, this study was aimed at providing an example of sustainable scale development, in support of the UN's sustainable development goals [46]. For latent trait measures to be sustainable, they require very strong psychometrics and limited error. We aimed for community input, including review and suggestions on item wording. We also aimed for validity across diverse groups, as sustainable scales require demonstrated utility across demographics. Sustainable scales also need to be free to use and modifiable, so that low-income populations can use the instrument and future research can improve measurement accuracy.

Transparency and openness
We utilized a multidisciplinary open science approach to help achieve our goals of contributing to sustainable development of good health and wellbeing, knowledge and skills sharing, and global partnerships. That includes making data, methods and analyses publicly available and making the SS freely available through a Creative Commons BY 4.0 license [47]. A preprint of an earlier version of this manuscript produced feedback, resulting in several modifications [48]. An open methods presentation provides additional information on study methods [49].
Survey participation was open to anyone meeting minimum age requirements with adequate language skills. All questions were voluntary, other than a mandatory minimum age gateway item. Forced-choice questions were not used as they can lead to higher dropout and lower data quality [50,51]. Permission to use C-SSRS scales was obtained from the copyright holders. Ethics approval was obtained through the first author's host university ethics committees (S1, 2017001069, H0016220; S2, H19153; S3, H20149), and studies were in accord with the World Medical Association's Declaration of Helsinki [52]. Participants indicated consent by clicking on an 'agree to participate' button after reading a study information statement. With ethics committee approval, parental consent was not required for Study 1 (S1) participants aged 13+ years or S2 participants aged 14+ years, but was required for S3 participants aged 14-17. Adolescents were a target group for these studies to help validate the SS across a broad age span. Our open science approach includes youth participation rights [53,54]. However, no student research-credit participants were included, or incentives offered, due to data validity concerns [55,56]. Analyses were conducted with the open-source statistical environment R, v.4.1.1, Kick Things [57]. R code and data are available at: https://osf.io/vjxnq/. Procedure S1 included the selection of suicide pool items and scales, review and revisions of items, data collection, and psychometric analyses. A multidisciplinary panel (N = 12) selected pool items, reviewed candidate items, evaluated linguistic and cultural validity, conducted the studies and evaluated results. Panelists came from several countries, backgrounds, and disciplines such as psychology, medicine, education, and genetics. After determining an item pool, we piloted test items with community members, asking for feedback on clarity and content. Results led to several wording changes. Next, identical online surveys were conducted for S1 in English (N = 5,115) and Chinese (N = 2,988). Description and findings of the Chinese language study are extensive and presented elsewhere [58]. S2 (N = 814) included SS modifications and a time-two (T2) two-week follow-up (n = 190). S3 (N = 626) tested additional revisions.
Three sequential cross-sectional surveys obtained participants through social media advertisements (e.g., Instagram, Facebook) and snowballing. Surveys were promoted approximately 2-14 weeks. Researcher-funded advertising totaled < US$3,000. Participants first read information statements, then indicated their consent to answer questions on suicide and other topics. We utilized an anonymous online platform, as anonymity can improve response accuracy on stigmatized topics [59][60][61][62]. However, to obtain T2 participants, we requested email addresses to send a survey link to, which were deleted after two invitations. Surveys included open comments and provided contacts to freely available international support services and took about 10-15 minutes to complete. To obtain a sample representing the full suicidality spectrum, S1 was promoted as a study on suicide, a method that has resulted in high participation rates by suicidal people [e.g., 63]. We used progress bars and a simple but attractive format to improve response rates [64]. Pool items were randomized to limit order bias, with the exception of items within depression scales. Demographic items were presented last to limit social desirability bias.

Measures and factor analysis
Surveys included measures of psychopathology and positive factors. We expected the SS to show strong positive correlations with psychopathology/risk factors and negative associations with protective factors. Unless otherwise indicated, we used discrete visual analogue responses (e.g., 1 = very unlikely, 2, 3, 4, 5 = very likely), as typical Likert-type responses may be less likely to show equivalent response steps [65].
All scales were examined for factor structure, unidimensional model fit, and internal consistency (Table 1). We conducted minimum residual FA (direct oblimin rotation) with the psych package, utilizing a mixed tetrachoric and polychoric correlation matrix when accommodating dichotomous and ordered-categorical items [66]. This method provides an unweighted least squares solution, which is more robust to skewed distributions [67]. Comrey and Lee [68] considered factor loadings � .71 (sharing > 50% common variance) as 'excellent.' Similarly, communalities (h 2 ) � .60 indicate a strong representation of the factor structure [69]. In addition, the Tucker Lewis Index of factorability (TLI) and root mean square error of approximation (RMSEA) are provided as indicators of model fit. High model fit (e.g., TLI) should be near 1.0, while error should be close to 0, but we do not apply cutoff score interpretations [70,71]. We used the coefficientalpha package [72] to calculate robust ω, with bootstrapped 95% CI's, as a recommended estimate of internal consistency for congeneric scales [73,74]. S1 included the Satisfaction With Life Scale [75], a five-item measure of global satisfaction with life. The Patient Health Questionnaire-8/9 [76,77] assessed participants' somatic and non-somatic depressive symptoms and contributed one pool item (Dead). The Depression Anxiety Stress Scales [78] included two seven-item scales assessing past-week non-somatic symptoms of depression, and somatic and non-somatic anxiety symptoms (DASS-A), on fourpoint response sets. The depression scale contributed one pool item (Meaning).
S2 included the Multidimensional Scale of Perceived Social Support's [79] two four-item subscales assessing perceived social support from family and friends. S3 included three freely available Patient-Reported Outcomes Measurement Information System 1 scales [45]. We modified PROMIS 1 scales by converting Likert-type to discrete visual analogue responses. Emotional Support v2.0-6a [80] is a six-item measure of current feelings of being emotionally supported and valued. The Emotional Distress Depression Scale v1.0-8a [81] consisted of eight items measuring past-week non-somatic depressive symptoms. The PROMIS-A v1.0-8a [81] measured participants' non-somatic anxiety symptoms.

SRAs.
All studies included the SABCS [27]. Legacy SABCS includes six items assessing affective, behavioral, and cognitive suicidal attributes. As recommended [40], the behaviors item was expanded to three items: Ideation-lifetime, Plan-intent, and Attempt-intent, which were graded by intent to die. Panel discussions and community feedback led to the following modifications. Wish to live (WTL) and wish to die (WTD) timeframes were changed from 'right now' to 'recently' to include a longer but contemporary affective state. Similarly, Debate was changed from 'ever' to 'in the past year.' We calculated a modified version (SABCS-m) for scale comparisons, which included: Debate, Ideation-year, WTD, WTL (reverse-scored), Predict (prediction of future suicide attempts), and Attempts-intent.
Two C-SSRS scales [37] were included for the item pool and scale comparisons. The C-SSRS self-report screener (C-SSRS-10) includes ten yes/no items, and has received several notable endorsements [e.g., 82]. Ideation-binary has been used as a gateway item. Those responding 'yes' complete all items, those responding 'no' only complete Sleep and Planbinary items. We asked participants to complete all items. The five-item suicidal ideation intensity scale (C-SSRS-5) is scored on six-point Guttman response sets, differing for each item. Wording was taken directly from the clinical scale with minimal modifications for selfreport.
Pool items & selection (S1). Over 200 items from over 50 SRAs were reviewed for inclusion. Most instruments overlapped with identical or similar items on cognition (suicidal Note. TLI = Tucker-Lewis Index; RMSEA = root mean square error of approximation; V = common variance; h 2 = communalities; ω = internal consistency, 95% CI (bootstrapped 1000 iterations); SWLS = Satisfaction with Life Scale; PHQ = Patient Health Questionnaire; DASS = Depression Anxiety Stress Scales; C-SSRS-thoughts), behaviors (suicide plans and attempts), and less often-affect (desire to live/die). In contrast to most SRA studies, we included items on the internal suicidal debate (Debate, RFD). There were many minor wording differences, often on timeframe (e.g., past 7 days, lifetime), synonyms (e.g., kill yourself, end your life), and response options. Many used dichotomous responses, some used ordered behavioral frequencies (e.g., once a week, 2-5 times a week). Most used Likert or Guttman-type ordered-categorical responses. We considered psychometric properties and popularity, aiming for diversity among validated suicidal facets. Item selection was informed by theory, such as Shneidman's [16] commonalities of suicide and the suicidal barometer model [27]. We included three popular single-item (SI) SRAs: the PHQ-9's Dead; the BDI-II's Ideation-BDI; and the Hamilton Depression Rating Scale's [83] Wish-HAMD, which is similar to the Quick Inventory of Depressive Symptomatology SI [84]. These single-item SRAs have been used in numerous studies and clinical settings [e.g., [85][86][87]. It is also noteworthy that several items owe their roots to the SSI and other early instruments but have been modified. In addition to wording changes, we sometimes expanded response sets as evidence shows 4-7 points are usually ideal [88,89]. Ultimately, S1 included 30 pool items (Appendix A in S1 File).

Analyses
Analyses followed expert advice which proposes that multiple fit indices are useful for improving measurement models, but that cutoffs should not be used to accept or reject models [e.g., 9,70,90]. We did not use CFA due to unmet tau equivalence assumptions, and as FA and BA are more suitable for identifying the true underlying structure [23,91]. Rest-score plots examined monotonicity and linearity [45,92]. We used the psych package [66] for hierarchical cluster analysis (CA), FA, and BA. IRT analyses used the ltm package [93].
CA indicates the ideal number of clusters and item loadings. It includes an estimate of model fit and error (root mean square residuals, RMSR). In addition, CA analyses provide a graphic illustrating cluster hierarchies to examine item associations.
We conducted BA using Schmid-Leiman oblique rotations [94]. BA includes general factor item loadings and communalities comparable to FA, and additional common variance unique to item grouping factors [66,95]. In addition, BA provides explained common variance (ECV), an indicator of unidimensionality, McDonald's ω h as an estimate of common latent trait variance, and model error (RMSEA) [73,74]. We examined both general and group factors, however, for scale diagnostics, we focus on general factor statistics as our aim was to identify core latent trait attributes. Group factor trends are presented for discussion.
With IRT, the latent trait is quantified as theta, with scores typically ranging from -4.0 to 4.0. Higher values indicate higher trait levels. Analyses provide item discrimination/slope (a) and information functions (IF), which inform us of the item's ability to discriminate individuals on latent trait levels, and how much information they provide, respectively. IRT also provides item response category cutpoints (b), and graphics illustrating IF and b values, helping to identify problems in monotonicity, number of item responses, uninformative items, and total test information. We determined the graded response model [GRM; 96] fit data best as it allows for variance in item discrimination and response formats, if responses are graded (increasing/decreasing or dichotomous).
We also calculated empirical Bayesian estimates of individual ability estimates. Ability scores are GRM-derived theta values based on individual item characteristics, and unique scale response patterns. These were compared with traditional sum scores.
We assessed test-retest relative reliability through Pearson's r, and absolute reliability via intraclass correlation coefficient (ICC 3,1 ; two-way mixed model, absolute agreement, single measure). Younger ages (e.g., aged 13 only, 13-15) were examined for unique response patterns through item/scale diagnostics [97]. We also checked item and test invariance by demographics (e.g., age, first language) and clinical factors (self-reported psychiatric diagnosis, lifetime suicide attempts) through differential item functioning (DIF) and differential test functioning. This approach has demonstrated superiority over CFA and other invariance tests [98]. We used the lordif package [99], which conducts an iterative hybrid ordinal logistic regression based on GRM modeling to detect DIF through R 2 change (� .02), which is preferable to using sum scores and is robust to non-normally distributed data.

Data treatment
Data cleansing involved identification and treatment of missing values, univariate and multivariate outliers, and inauthentic responses [100][101][102][103]. We considered missingness, Mahalanobis' distance scores, and used the careless package [104] to identify psychometric antonyms and long strings, to guide removal on a case-by-case basis. Item pool missing values totaled 7.5%, S2 missing = 5.2%, S3 = 10.4%. Missing values were replaced through expectation-maximization, a recommended single-input method [105,106]. Gender and ethnicity were dichotomized (male/female sex, Euro-Caucasian/other) for some analyses. We conducted bootstrapping (1,000 iterations) to better approximate population statistics and correct for deviations from normal distributions [107,108].

Item selection
Analyses began by testing for unidimensionality with the item pool [109], followed by reducing items to a parsimonious set maximizing latent trait information [110]. CA showed pool items reasonably formed a single but complex cluster, fit = .96, RMSR = .07. FA results indicated a single factor explaining 66% common variance, TLI = .47, RMSEA = .27. BA results were more ambiguous, showing a moderately strong general factor, and three weak to moderate group factors, ω h = .83. We compared unconstrained GRM (items may vary in discrimination/slope levels) vs. constrained (items discriminate equally on theta). ANOVA results showed the unconstrained model fit best, with less information loss, ΔAIC = 5,013, p < .001. We therefore conducted unconstrained GRM. Table 2 shows pool diagnostics, revealing weak to very strong items.

PLOS ONE
The suicidality scale We repeated analyses, removing the worst fitting item one by one. When an item is removed, theta is redefined by remaining items as we refine the model. Behavior items (selfharm, attempts, plans) were among the weakest and were removed early. We found 16 items with strong psychometrics. Analyses were repeated with subsamples (e.g., aged 13 [n = 355], aged 13-15 [n = 1917], aged 40+ [n = 224], native English vs. non). Some items, such as Planever, Ideation-lifetime and Ideation-control, were removed due to weaknesses with multiple groups. An 11-item set included four items that were valuable but with shortcomings: Ideation-BDI, Save, Ideation-times, and Meaning. Ideation-BDI showed weaknesses with youth and monotonicity. Save showed comparatively lower performance overall and with youth, and some linearity issues with non-native English speakers and non-Euro-Caucasians, so was removed. We compared two similar items: Ideation-year and Ideation-times. They differ by the subjective 'often' vs. specific frequency (e.g., 2-5 times a week). Ideation-times showed comparative weakness with older participants, while Ideation-year showed slightly higher discrimination for the full sample (2.84 vs. 2.63) and was therefore selected as the stronger item. We retained Meaning as it performed well overall, only showing slightly lower properties with more extreme age groups. We also thought it might benefit by rewording and expanding response points.

Suicidality Scale psychometrics
We found eight items provided a highly informative measure across the full sample and subsamples-forming the Suicidality Scale. Table 3 shows high but variable item discrimination and information functions, supporting decisions to treat items as non-uniform indicators of suicidal attributes. We also see important variations in item abilities to discriminate at the lowest and highest trait levels. Ideation and Debate captured more information at low levels, while RFD and DKS did so at high suicidality levels. Fig 1 illustrates the GRM output in Table 3. The breadth of item thresholds (b values) indicates theta coverage. The volume under each item's line indicates the amount of information captured on the latent trait. Fig 2 shows the SS hierarchical cluster pathways and BA group and general factor associations. Note that the algorithm attempts to determine three meaningful group factors [109]. However, only one group factor with loadings � .20 was identified in S1. Figs 3 and 4 show S2-S3 cluster and BA diagrams, respectively. Item group associations may help us understand the nature of the latent trait. Across the three studies, we see weak to moderate evidence of two subgroups, more evident in S3. Note. b l = lower item threshold, b u = upper, a = discrimination, IF = information function, Clus = hierarchical cluster loading, FA = minimum residual factor analysis, BA = bifactor analysis, L = common factor loading, h 2 = communality, g = general factor. https://doi.org/10.1371/journal.pone.0282009.t003

PLOS ONE
The suicidality scale

PLOS ONE
The suicidality scale In S3, we see the upper end of theta is not well defined, likely due to the smaller sample size and fewer participants at higher suicidality levels. Table 4 presents SS model fit statistics for all studies, demonstrating a strong, if imperfect, measure with high fit, low error, and high internal consistency.

Differential item & test functioning
We next performed DIF and DTF checks to determine if item or test scores differ by group membership, resulting in biased assessment. Grouping variables include 2-3 categories: age (A = 13-18, 19+; B = 13-15, 16-19, 20+; C = 13-39, 40+); gender (A = male/female/nonbinary + ; B = male/female); region = urban/town/rural; ethnicity = Euro-Caucasian/other; first language = English/other; psychiatric diagnosis yes/no; suicide attempts yes/no. No SS items, or test total, showed DF for any grouping (ΔR 2 < .02). When examining the best 11 items (including Save, Ideation-BDI, Ideation-times) we found some evidence of DTF by age and psychiatric diagnosis, indicating that including one or more of those items results in discrepant inter-group evaluations. The lack of DIF or DTF for participants with or without a lifetime suicide attempt informs us that there was no meaningful difference in trait assessment due to attempt status. S2 and S3 DIF checks revealed no evidence of invariance by age groups, ethnicity, gender, urban/rural residence, or between South Africans (S3; n = 141) and others.

Predictive ability, test-retest reliability
In S2, T2 (two-weeks) examined temporal stability of the SS and evidence of short-term predictive ability. An ANCOVA (controlling for sex, age, ethnicity) comparing participants who completed T2 (n = 190), with those who did not, showed no statistically significant group difference with SS T1 ability scores, F(1, 809) = 0.39, p = .53, η 2 p = .00. Partial correlations (controlling for demographics) compared T1 with T2 ability scores (derived from T2 data only), showing high temporal stability, r = . 87

SS revisions
Sustainable scale development includes testing modifications with the aim of making incremental improvements when warranted (see Appendix B in S1 File for revisions, Appendix C in S1 File for final SS). S1 included legacy PHQ-9 and DASS-D items that met criteria for inclusion in the SS but showed weaknesses. For S2, we increased responses from four to five, removed non-anchor labels, and reworded for clarity and consistency. Notably, we revised Dead to remove the double-barreled format. We kept 'better off dead' and deleted 'hurting yourself.' 'Better off dead' is more directly relevant to suicidality, and evidence shows selfharming is a separate factor from suicidality [e.g., 111]. Also, our analyses showed Self-harm was the least valid pool item. For Meaning, we added 'your' to make the statement 'life is meaningless' more personal, as suicidality is most relevant to the self [e.g., 16,112]. For Debate, we used past year for S1 and lifetime for S2. Given slightly lower psychometric properties in S2, lifetime may be too long for that item. We also used the subjective term 'recently' for some items, including Dead. 'Recently' appeared to work well, based on item statistics. Tables  6 and 7 show all items maintained strong psychometric properties across studies.
Item response characteristic curves assist in checking monotonicity and response set validity. Fig 6 shows, for S1, all item responses were appropriately aligned on theta, with no apparent violations of monotonicity. However, for the Dead and Debate items, the second-highest

PLOS ONE
options were not well-supported, indicating revised response sets or other adjustments may be helpful. In addition, WTD showed seven points may be too many as the fifth option was under-endorsed. These variations in item response characteristics, including different locations on theta (b values) for specific item responses, are further evidence against tau-equivalence. In S3, WTD diagnostics were strong with five response points. In S3 we also see that five points appears to be too many for Predict, however, that item captures relatively more information on high theta levels and S3 had fewer highly suicidal participants.

Ability scores and SS associations
We next examined associations between SS ability scores and psychosocial variables, including available SRAs, controlling for demographics. Table 8 shows correlations were in expected directions, positive with psychopathology and negative with protective factors. Note that the SS shared three items with the SABCS-m, and single items with the DASS-D (Meaning) and PHQ-9 (Dead) in S1, which were later revised. Note. Clus = hierarchical cluster analysis, FA = minimum residual factor analysis, BA = exploratory bifactor analysis (Schmid-Leiman), L = common factor loading, g = general factor loading. https://doi.org/10.1371/journal.pone.0282009.t006

PLOS ONE
The suicidality scale We next tested the question-does the measure matter? We compared correlations between industry standard C-SSRS sum scores and depression sum scores (CTT method), with SS ability and depression ability scores in S1. We avoided autocorrelation by using the PHQ-8 and the DASS-D-6 (removing Meaning), and statistically controlled for age, sex and ethnicity. We then tested the CTT hypothesis that higher sum scores necessarily indicate higher levels of the latent trait. Results did not support the hypothesis as we saw notable overlap in ability scores for specific sum scores. For example, an SS sum score of 21 (ability range = -0.49 --0.08) includes those with theta lower than some cases with a sum of 18 (range = -0.97 --0.40), and higher than some with a sum of 24 (range = -0.12-0.17).

Discussion
This project was aimed at demonstrating sustainable scale development through validating a more precise measure of the latent trait suicidality. Through consecutive studies and revisions, the eight-item Suicidality Scale demonstrated high psychometric properties by capturing facets most relevant to the construct. Tests showed the SS performed well across several demographic groupings, and by mental disorder and suicide attempt history (self-reported yes/no). The strength of these findings across diverse samples and groups provides strong evidence that the SS measures common suicidality characteristics, fulfilling the core requirement of scale validity-it measures what it is supposed to measure. It is notable, but not surprising, that no dichotomous items demonstrated sufficient validity for inclusion in the final scale. Behavior items also showed weaknesses compared with affective

PLOS ONE
The suicidality scale and cognitive items. These findings extend on the SABCS study [27], which used IRT and FA to determine a valid measure. Authors, however, allowed theory to rationalize retaining a moderately valid behavior item. Our findings provide further evidence that suicidal behaviors are meaningful facets of suicidality but items, in numerous variations, have not demonstrated sufficient validity for accurate risk assessment. Additionally, we found no support for including a self-harming item. That area benefits from unique construct-specific research [e.g., 113]. We also found no DIF by suicide attempt history, indicating the underlying trait can be assessed equivalently regardless of attempt status. This is also evidence against the hypothesis that attempt status alone can define risk. Nevertheless, behavior items remain important for biographic data and clinical evaluations.

Mapping the suicidal mind
Hierarchical cluster and bifactor analyses revealed two possible four-item groups. GRMderived item threshold statistics show one set appeared best at capturing information at the lowest assessed suicidal levels: Ideation, Meaning, Debate, and Dead. These findings indicate that early or low suicidality may be characterized by infrequent thoughts of suicide, with some thoughts revolving around an active suicidal debate. Feeling that one's life has no meaning provides an affective element. The evidence here confirms that suicidality is more than simple behaviors or thoughts, it includes an internal struggle between choosing life or death [114][115][116]. Evans and Farberow [117] presented life/death ambivalence as possibly the most important aspect of the suicidal mind. With follow-up study, we may be able to verify such early, or lower-risk, phases.  We also saw some consistency in items capturing information at the highest theta levels: RFD, DKS, WTD, and Predict. RFD, like Debate, is directly related to life/death ambivalence [16] and suicidal debate theory [114]. DKS and WTD provide affective suicidal facets. It may be that focus on the finality of one's decision, to kill self, to die, is highly relevant to suicidal people at their greatest risk of performing such behaviors [118]. 'Predict' provides further weight to concluding that debate with action. With precise SRAs, we may improve our understanding of how the suicidal mind develops and sometimes transitions into high-risk behaviors.

Clinical decisions and cutoff scores
The greatest challenge for SRAs may be in translating assessments into appropriate clinical directions. Clinical decisions are often ordered, ranging from no treatment needed to emergency care. Many scales include attractive cutoff scores (e.g., SSI, C-SSRS), for low to high risk. However, those cutoffs were established through highly questionable ROC and AUC analyses. For SRAs, such cutoffs are based on three disproven hypotheses: 1) all items are equal in quantifying suicidality; 2) responses of all items are equally graded; 3) binary outcomes (e.g., suicide attempt vs. no attempt, high suicidality vs. low suicidality) are true dichotomies. Our research replicates previous studies demonstrating a lack of SRA tau equivalence through FA [15,111] and IRT analyses [25,27]. Therefore, the predictor variable, the SRA, cannot produce valid cutoff scores as sums include items and response steps of unequal weights and increments. As we saw here, SS sum scores of 21 can represent a range of latent trait levels.
Given the lower validity of sum scores compared with ability scores, and the lack of validity of SRA cutoff scores, how can clinicians use SRAs? Our evidence shows individuals with minimum scores, or slightly above that, are currently at a non-or low-suicidal level and may be treated as such. Individuals with highest or near-highest scores evidence high suicidality/risk and should be treated accordingly. For those scoring in between these extremes, it is not yet possible to determine valid risk groupings. We used S1 data, due to the large volume and diversity on theta, and the suicidal barometer model [27] to help illustrate the suicidal mind (Fig 7). In contrast to CTT-derived cutoff score protocols, and consistent with PROMIS recommendations [80], we propose scores be used to guide but not dictate clinical decisions.

Limitations and future sustainable scale development
We made efforts to establish valid datasets, however, no treatment of outliers, inauthentic data and missing values can yield perfectly authentic data. Carefully considering these factors resulted in significant improvements over alternatives, such as deleting all cases with missing values or ignoring inauthentic responses [100,101,103]. We used cross-sectional convenience samples, which are not ideal but can be as representative of study factor associations, and thus generalizable, as large representative samples [119]. S2 and S3 were moderately-sized but included fewer participants at high suicidality levels, limiting our ability to draw clear conclusions on some model and item characteristics. While studies included sufficient youth samples, we had fewer participants aged 60+. That may be partly due to the online platform. Regardless, further study is required to validate assessment with older ages.
Validating any SRA requires testing prediction abilities. We included a time-two sample (two weeks), however, larger samples over longer periods are required to examine temporal consistency and prediction. To provide more valid tests of SRA predictive abilities, we also require improved measurement of outcomes. We hypothesize that polytomous representations of suicidal outcomes (e.g., suicides, suicide attempts), would provide more information and greater validity than current dichotomous taxa. Several studies have shown suicide attempt status (yes/no) is a demonstrably false dichotomy, as there are degrees of risk within and overlapping taxon [15,40,41,120]. Expanding not-fit-for-purpose categorizations, including cause of death, to a limited continuum can be accomplished through assessing variations in intent to

PLOS ONE
The suicidality scale die. For example, Tabachnick [121] and Shneidman [115] promoted the concept of subintentional death, a death that may be due to non-suicidal causes but the decedent was experiencing suicidal symptoms and knowingly put themselves at risk. That and other approaches could help improve the validity of outcome variables for testing SRA predictive abilities.
One of our most important aims was testing the validity of adolescent assessments. Our results fit with previous findings showing children over 12 years are capable of completing self-report psychological assessments [122]. We saw no evidence that younger ages answered discrete visual analogue scales differently than others, and there was no DIF with those aged 13-18. There were no adverse incidents reported, and many adolescents left positive comments regarding their study participation. These findings demonstrate suitability for including youth participants in ethics-approved studies with SRAs, without parental consent. Their volunteer contributions should inspire more efforts to include young people in citizen science type efforts.
There has never been any doubt that valid SRAs can be useful in genetics research, CAT, ecological momentary assessments, etc. Employing low validity measures, however, provides no real benefits with such advanced and potentially groundbreaking approaches. We envision a near future where mental health checkups include CAT using highly validated instruments. These assessments can highlight personal attributes on a network of mental health factors (e.g., depression, suicidality, emotional stability). That information may be combined with neuroimaging techniques, producing more comprehensive psychobiological mental health reports [e.g., 123]. In addition, network analysis has demonstrated unique abilities in describing complex mental health patterns [124]. That, and using precise measurement may help further elucidate the suicidal mind, leading to more insightful work with neuroimaging and genetics. Such a symbiotic mesh of highly valid latent trait and biological evaluation has potential for providing as accurate a picture of mental health as we can for physical health.
In this study, we attempted to conduct scale development according to evidence-based practices, using appropriate measurement models and critically evaluating findings [5,6,20]. Sustainable scale development also includes best practices in all areas of psychological science, as well as community involvement. To improve research and clinical practices, we join others in providing publicly available measures [11]. We chose a Creative Commons CC BY 4.0 license for the SS to encourage collaboration and incremental improvements. The SS manual also has a CC BY 4.0 license and will be updated in response to future developments, including SS versions in Chinese, Spanish, etc. [125]. We welcome the suggestions from Kirtley and colleagues [126] on open science in suicidology. We hope such efforts will encourage evidencebased and critical analytic approaches with large datasets, community involvement and using free and open clinical and research instruments.

Conclusions
For decades, suicide risk assessments have been consistently poor to mediocre. To address this long-standing limitation, we chose a sustainable evidence-driven method to produce a valid and reliable measure-the Suicidality Scale 1.0. It is not perfect. It is, however, a step forward. The SS showed stronger psychometric properties than three comparison scales and demonstrated validity across diverse samples and groups. Using more precise measurement will help elucidate latent traits and refine our psychobiological models. Creating more accurate and sustainable instruments should also translate into improved epidemiology, clinical decisions, and prevention of deaths by despair. If we are to make meaningful inroads into solving the great psychological problems of our times, instrument consistency cannot be allowed to trump measurement validity.